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Abstract —Subspace models play an important role in a wide 
range of signal processing tasks, and this paper explores how 
the pairwise geometry of subspaces influences the probability of 
misclassificatlon. When the mismatch between the signal and the 
model is vanishingly small, the probability of misclassification is 
determined by the product of the sines of the principal angles 
between subspaces. When the mismatch is more significant, the 
probability of misclassificatlon is determined by the snm of 
the squares of the sines of the principal angles. Reliability of 
classification is derived in terms of the dlstrlbntion of signal 
energy across principal vectors. Larger principal angles lead 
to smaller classification error, motivating a linear transform 
that optimizes principal angles. The transform presented here 
(TRAIT) preserves some specific characteristic of each individnal 
class, and this approach is shown to be complementary to a 
previously developed transform (LRT) that enlarges inter-class 
distance while suppressing intra-class dispersion. Theoretical 
resnlts are supported by demonstration of superior classification 
accuracy on synthetic and measured data even in the presence 
of significant model mismatch. 

Index Terms —subspace, classification, SNR 


I. Introduction 

S ignals that are nominally high dimensional often ex¬ 
hibit a low dimensional geometric structure. For example, 
fixed-pose images of human faces are recorded using more 
than 1000 pixels, but can be represented by a 9-dimensional 
harmonic subspace 0 - Motion trajectories of a rigid body 
might be recorded by hundreds of sensors, but must intrinsi¬ 
cally be represented by a 4-dimensional subspace There 
are many more examples where a low-dimensional subspace 
model captures intrinsic geometric structure, ranging from user 
ratings in a recommendation system Q to signals emitted 
by multiple sources impinging at an antenna array Q. The 
subspace geometry has assisted tasks of interest to both signal 
processing 0,0 and machine learning communities 0,0. 

A Gaussian Mixture Model (GMM) measures proximity to 
a union of linear or affine subspaces, by imposing a low-rank 
structure on the covariance of each mixture component. It 
can be used to approximate a nonlinear manifold by fitting 
mixture components to local patches of the manifold Q, 
0> hence providing a high fidelity representation of a wide 
variety of signal geometries. The simplicity of the model 
facilitates signal reconstruction |[T0|-p3], making GMMs a 
very attractive signal source model in compressed sensing. 
The value of low-rank GMMs extends to classification, where 
each class is modeled as a low-rank mixture component, and 
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classes are identified by their projections onto linear features. 
Optimal feature design is addressed in 0, og. 

The GMM is usually only an approximation to the truth. 
For example, the full spectrum associated with a face image 
follows a power law distribution, and when we truncate to 
the first 9 harmonic dimensions, the residual energy will be a 
source of error in classification. Even if the true model were a 
GMM, we can only learn an approximation to the true model 
from training data. The more data we see, the better is the 
fit of our empirical model, but some degree of mismatch is 
unavoidable. If we treat this mismatch as a form of noise, then 
we can use information theory to derive fundamental limits on 
the number of classes that can be discerned (see m for more 
details). 

This paper explores how the pairwise geometry of subspaces 
influences the probability of misclassification. There are par¬ 
allels with non-coherent wireless communication i fTfi) , where 
information is encoded as a subspace drawn from a fixed 
alphabet, and the function of the receiver is to distinguish 
the transmitted subspace. When each component is perfectly 
modeled as a Gaussian, the performance of the MAP classifier 
can be analyzed using the Chernoff Bound p7) . When fidelity 
is perfect, there is no mismatch, and fundamental limits on 
performance are determined by the rank of the intersection of 
the classes (H), m- 

In this paper, we further consider how best to discriminate 
classes, when the alignment between the GMM model and the 
data is only approximate. We make three main contributions 
in this paper: 

1) We express the probability of pairwise misclassification 
in terms of the principal angles between the corre¬ 
sponding subspaces. This expression depends on the 
mismatch between the signal and the model. Interpreting 
this mismatch as noise, we provided analysis of the low, 
moderate, and high SNR regimes. This improves upon 
1181, in the sense that we have a more explicit expression 
of the “measurement gain” as proposed in 

2) We characterize the probability of misclassification for 
more general distributions near subspaces. This is mo¬ 
tivated by the case where training samples per class 
are insufficient for a reliable estimate of covariance. In 
these cases, we have very little knowledge about the 
signal’s distribution and a MAP classifier is not good 
fit. The Nearest Subspace Classifier (NSC) provides an 
alternative and we use the NSC classifier rather than the 
MAP to bound the probability of misclassification. 

3) We develop a feature extraction method, TRAIT, that ef¬ 
fectively enlarges principal angles between different sub- 
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spaces and preserves intra-class structure. We demon¬ 
strate superior classification accuracy on synthetic and 
measured data, particularly in the presence of signihcant 
model mismatch. 

This paper is organized as follows. Section |I^ presents 
the subspace geometry framework. Section m analyzes the 
Maximum a Posteriori (MAP) classifier under the GMM 
assumption. Section HYl analyzes the performance of Nearest 
Subspace Classifer (NSC), which relaxes the GMM assump¬ 
tion. Section [V] proposes a feature extraction method, TRAIT, 
that exploits subspace geometry, and presents experimental 


results for both synthetic and measured datasets. Section VI 
provides a hnal summary. 

A note on notations; we use bold upper case letters for 
matrices, e.g., X, and bold lower case letters for vectors, e.g., 
X. The transpose of a matrix X is denoted by X^. Scalars are 
written as plain letters, e.g.. A, K. 


II. Geometric Framework 

Consider two subspaces X and y of K" with dimensions I 
and s respectively, where £ < s. The principal angles between 
X and y, denoted as are defined recursively as 

follows 

01 = minxigA’.yiGy arccos (ip^^]|§7iy) , 


= min x,GA.y,Gy arccos ) , J = 2,..., £. 

yj_Lyi,...,yj_i 

The vectors Xi,..., and yi,..., y^, are called principal 
vectors. The dimension of An V is the multiplicity of zero as 
a principal angle. It is straightforward to compute the principal 
angles by calculating the singular values of X^Y, where X 
and Y are orthonormal bases for X and y respectively. The 
singular values of X^Y are then cos0i,..., cos0^. 

Let £ = s. The principal angles induce several distance 
metrics on the Grassmann manifold, of which the most widely 
used is the (squared) chordal distance 'Dl{X,y) given 

by 

S 

Vl{X,y) = 0 *- 

i=l 

The chordal distance is an aggregate, and in the following 
sections we will see how probability of misclassihcation 
depends, not so much on this aggregate, but on the individual 
principal angles. 

III. The map Classieier eor a GMM 

We begin by considering the MAP classiher, which is 
optimal when the signal distribution is known. We focus on 
binary classihcation, where the two classes are equiprobable, 
since the generalization from two to many classes is well 
understood GD, m- 

We model each class as zero mean Gaussian distributed, 
where the covariance is near low-rank. Classihcation can be 
formulated as the following binary hypothesis testing problem 

iFi;x ^ AA(0,Si) 

H2:k ~ AA(0,S2). 


We justify the zero-mean assumption by observing that in 
applications such as face recognition 0, or motion trajectory 
segmentation Q, the actual mean is considered as a nuisance 
parameter, and is removed prior to processing. Given the near¬ 
subspace assumption, we model the two covariances as 

51 =UiAiU7 +a^l 

( 2 ) 

5 2 = U 2 A 2 UJ + a^l. 


where Ui, U 2 G are the orthonormal bases for the two 

signal subspaces, denoted by Xi and A 2 . Typically n ^ d. 
Ai,A 2 G are diagonal matrices of eigenvalues. We 

assume that the two subspaces have the same dimension d, 
and that the diagonal elements of Ai, A 2 are arranged in 
descending order. In the application to motion trajectories we 
take d = 4, and in the application to face recognition we might 
take d = 9. Denote the i-th largest eigenvalue of Aj by Xj^i. 
Finally let cr^ be the variance of the noise, which quantifies 
the degree of mismatch between the subspace model and the 
data. 

Denote the probability of mistaking hypothesis 2 for hypoth¬ 
esis 1 by Pr(iJ 2 |Tfi), and define Ft{Hi\H 2 ) similarly. Under 
the assumption that the two hypotheses are equiprobable, the 
error probability Pe of a MAP (optimal) classifier is 



[¥t{H2\Hi) +¥t{Hi\H2)] 


j min(Pr(x|iFi),Pr(x|iF2))dx 


(3) 


Since this integral does not admit a closed form solution, we 
study the Bhattacharyya upper bound | |22| to instead. This 
bound is a special case of the Chernoff bound p7| derived 
using the observation min(a, h) < \fab. The Bhattacharyya 
bound gives 


Pe < 7:6 
- 2 


-K 


1 


where iT = - In ,_ 

2 Vdet El 


det 


(4) 


det S2 

The numerator inside the logarithm measures the volume 
of space occupied by both subspaces together, and the de¬ 
nominator measures the volumes occupied separately. These 
quantities depend on the principal angles, and we now study 
the performance of the Bhattacharyya bound in the high, low 
and moderate SNR regimes. 


A. The High SNR Regime 

We first consider the case when 0, which means 

that the mismatch between the signal and the model becomes 
vanishingly small. Since the intersection Xi n X 2 between 
the two subspaces plays a special role, we write the two 
covariances as 

El = Ui.nAi^nU^n + + a^I, 

S 2 = U2,nA2,nUjn + ^ ^ 

Here both Ui.n € and Uz.n G span Ai n A 2 

with singular values Ai.n and A 2 .n respectively. Ui,\ G 
j^nx(d-r) Ai\A 2 with singular values A^ And U 2 ^\ G 
j^nx(d-r) A 2 \Ai with singular values A 2 ,\. 

The following theorem bounds the classihcation error in the 
high SNR regime. 
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Theorem 1. Assume n > 2{d — r). Ai cr^ 
classification error is upper bounded as 


Pe < ci(cr2)‘'2'’ [ 


Sin 




+ 0 (a^) 


0, the neglected. The MAP classifier in this case is characterized by 
the following theorem. 

Theorem 2. When tr^ is sufficiently large, the Bhattacharyya 
upper bound is sandwiched between 

= ^expj-^ |^C2 - ^Ai,iA2,i^cos^6»,^ | 


The constant ci is given by. 


Cl = 2 2 


--1 


Vn[=iAi.n.-nLiA 2 .n. 


d—r 


n V aia.* ■ a 2 ,\.* 


0 

II 

CS| M 

0 

t 

P ^ 

^ e 

n-^ 2 ,nUj^n) 

and 

A2,n,i 



where pdet denotes the pseudo-determinant. 


2 = 1 


= 1“;^ ^'=3 - ^Ai,iA2,i^cos^6»,j I , 

where > Pj^^■ And the constants C 2 and C 3 are given 

by ^ 


Proof. The method is to expand the Bhattacharyya bound 
in terms of principal angles, and the details are provided in 
Appendix □ 

Remark 1. 1) Typically n ^ dfor measured data, so the 

condition n > 2{d — r) is usually satisfied. 

2) The classification error is upper bounded by 

the smaller the overlap between subspaces, the easier it 
is to discriminate between classes. When two subspaces 
overlap completely, there is an error floor. 


There is a duality between the GMM classification problem 
and multiple antenna communication |231. In multiple antenna 
communications, a codeword is a d x n array, where the rows 
are indexed by transmit antennas, the columns are indexed by 
time slots in a data frame, and the entries are the symbols to 
be transmitted. The probability of mistaking codeword Ci for 
codeword Cj, Pr(i —>■ j), satisfies 

Pr(*^j)<(^W(l/A?...Al), 


where k is the rank of Ci — Cj, whose singular values are 
Ai,..., Afc. The primary objective in code design for multiple 
antenna wireless communication is to maximize the minimum 
rank of the difference between distinct codewords. If the 
minimum rank is k, the code is said to achieve a diversity 
gain of k. 

An important secondary objective in code design for multi¬ 
ple antenna wireless communication is to maximize the mini¬ 
mum product of the singular values of the difference between 
distinct codewords. This minimum product determines the 
coding gain. 

The counterpart of coding gain in classification is the prod¬ 
uct of sines of the principal angles. This quantity determines 
the intercept of the error exponent with the vertical axis. The 
smaller the energy in the intersection of the subspaces, the 
smaller is the classification error. The larger the principal 
angles, the smaller is the classification error. 



Proof. The details are given in appendix [B| □ 


Remark 2. The dimension of the overlap between the two 
subspaces plays a less important role in the low SNR regime, 
and classification error is a function of chordal distance. This 
gives rise to an interesting duality between GMM model based 
classification and the space-time decoding where error 

probability is influenced by product or sum diversity in high 
or low SNR regime respectively. 


C. The Moderate SNR Regime 

We now consider a moderate noise/mismatch regime, where 
^ < p for j = 1 ,..., d and p > 1 , c(p) > 1 . 
Moderate SNR also implies that p is not very large. 

The most important element in the analysis of classification 
error is to lower bound the term Indet () in Eq. 0 . 


In det 


(^) 


= In det 


/ UiAiU^ +U 2 A 2 Un 

1 2 a 2 ) 


-I- n In tr^. 


Denote the non-zero singular values of D = ^^(UiAiU^ + 
U 2 A 2 UJ) by Ai,..., \ 2 d-r- Then 


B. The Low SNR Regime 

This is the case where the noise variance cr^ and the 
singular values are commensurable; in other words, the mis¬ 
match between the signal and the empirical model cannot be 


In det 


(^) 


2d-r 

ln(l -I- Ai) -f nln((T^). 

i=l 


(6) 


The following lemma provides a lower bound on ln(l -f Ai). 
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Lemma 1. There exists 0 < L < such that for any 
\ G [LjP\, 

in(i + \i) > iii(i+p) + -p)- -p)^- 

(7) 


Proof. See Appendix [C| □ 

Let L{p) be the smallest possible value of L, define c{p) = 
2L{p) if ^^P) > 0 and c{p) = +c» if L{p) = 0. Note that 
c{p) > 1 since L{p) < 2^. 

Theorem 3. If < p, then the classification 

error is upper bounded as 

Pe<\ exp |-C4(2d - r) + + csj , 


where C 4 = 
p and 


i ln(l+p) 

^2,i 


Proof See Appendix [C| 


p 

i+p 


(l+p)2 


and C 5 depends on 


□ 


Remark 3. It is straightforward to show numerically that 
c{p) = 3.44, 2.79 for p = 4, 5 respectively, that c(p) > 2.02 
for p < 10, and that c{p) > 1.61 for p < 100. The form of the 
upper bound suggests that in the moderate SNR regime, the 
role of chordal distance is more important than the product of 
the sines of the principal angles. 


D. Numerical Analysis of Synthetic Data 

We explore the difference between classihcation in the low 
and high SNR regimes through a simple numerical example. 
Consider the following pairs of subspaces; 
case 1 ; 


'1 

0 

0 

o' 

T 

II 

(N 

1 

0 

0 

o' 

0 

1 

0 

0 


0 

0 

1 

0 


case 2 ; 


'1 

0 

0 

o' 

T 

1 

'1 

0 0 -1' 

0 

1 

0 

0 



0 

1 1 0 


We set Ai = A 2 = I for both cases. In case 1, the two 
principal angles are = 0,02 = 7'‘/2 and in case 2 , the 
two principal angles are 0i = = 7 r/ 4 . The chordal 

distances in these two cases are the same, but in case 1 the 
product of sines of non-zero principal angles is 1 , whereas 
in case 2 it is 1/2. However, there is a nontrivial intersection 
dimension in case 1. The product of nonzero sine principal 
angles is 1 for case 1 , and 7 for case 2 . 

We vary the degree of mismatch cr^, and evaluate the bounds 
developed in the above three theorems. In the high SNR 
regime, we plot the empirical misclassihcation probability 

Pf. with the value (^nf=r-i-i ^ given in 

Theorem [T] In the low SNR regime, we plot the upper bound 
Pf^^ in Theorem In the moderate SNR regime, we take 
B = 6 , and we vary between - and so that < 

t' ’ J p p ’ c(p) — 

^4^! ^ P- We then plot the upper bound in Theorem 3 


against the empirical classihcation error. In the high SNR 
regime (Fig. [T^, the classihcation error decays faster in Case 
2 than in Case 1, consistent with Theorem [T] In the low SNR 
regime (Fig. [rEj i, there is little difference in classihcation error 
between the two cases, consistent with Theorem In the 
moderate SNR regime (Fig. Ib 1 , classihcation performance in 
case 1 is inferior to that in case 2 , because there is a shared 
1-dimensional subspace, and this is predicted by Theorem]^ 
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Fig. 1. En'or probability as a function of the degree of mismatch. Dashed 
lines represent empirical estimates, and solid lines represent upper bounds. In 
the low SNR regime the two upper bounds coincide. 

Concluding this section, we have characterized the pair-wise 
classihcation error using the principal angles between a pair of 
subspaces. The union bound then makes it possible to derive 
an upper bound on classihcation error for multiple classes. 


IV. Nearest Subspace Classifier: extending GMM 

If the class distribution is known (for example through its 
covariance) then the MAP classiher is optimal. If however we 
only know that each class is near a known low-dimensional 
subspace (possibly inferred from less training data) then we 
can substitute a Nearest Subspace Classiher (NSC) for the 
MAP. This Section connects performance of the NSC with 
principal angles, and for simplicity we focus on discriminating 
parrs of classes, given that the extension to multiple classes is 
straightforward. 

Consider two classes, labeled Ci and C 2 , distributed near 
two subspaces with orthonormal bases Ui,U 2 S The 

NSC determines the class label of a test sample x, C, by 
comparing the norms of the projections onto Ui and U 2 . 

||U7x|p>||UTxf 
f C 2 otherwise 

The preferred class label has a basis that is better aligned to 
the signal. 


A. Derivation of the Upper Bound 

Starting from the projection onto each subspace, we model 
the distribution of these two classes as 


p(x|Ci) = J p{x\a,Ci)p{a.)da = j A/’(x; Uia, CT^I)p(a)c?a 

p(x|C2) = Jp{x.\cx,C2)q{a)da = J A/’(x; U2Q;, 

(9) 


The NSC knows Ui and U 2 , but is blind to p(a) and q{a.), 
where oc is the expansion of the projection Ujx in the basis 
Ui. Note that since we are not assuming a GMM, the vector 
a need not be multivariate normal. 

Let Vdiag{cos0i,... ,cos0d}W^ be the singular value 
decomposition of U]'^U 2 , where V, W are unitary, and the 
principal angles {0i,..., 9d} are taken in ascending order. We 
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may, absorb V, W into Ui, U 2 at the cost of redefining 
p{a), q{a). Thus we may without loss of generality assume 
V = W = I, i.e., 

Uj^Ua = diag{cos6»i,... ,cos6id} = C. (10) 

Dehne Pr(C 2 |Ci) as the probability of mistaking C 2 for Ci 
and dehne Pr(Ci|C 2 ) similarly. Then the classihcation error is 

Pe = ^Pr(C 2 |Ci) + ^Pr(Ci|C 2 ). ( 11 ) 

We bound Pr(C 2 |Ci) using principal angles, and Pr(Ci|C 2 ) can 
be analyzed in the same manner. We expand Pr(C 2 |Ci) using 
Bayes rule as 

Pr(C2|Ci) = f Pr(C2lCi,o:)p(a)do:. (12) 


We bound Pr(C 2 |Ci, a) by writing x = Uia + n, where the 
noise n ~ Af(0, cr^I). 


Pr(C2|Ci, a) =Pr(|lUT(Uia + n)f < ||Uj(Uia + n)f) 

=Pr(|la + U7n|p < ||Ca + Ujn||2), 

(13) 

where the probability is taken w.r.t. n. Denote the i-th column 
in Ui(U 2 ) as Ui i(u 2 ,i), and the i-th element of a as It 
follows from Eq. ( [T3 ] i that 

Prdla + Uj^nf < ||Ca + U]^nf) 

= Pr < ^{cos0,ai + . 

^ (14) 

We now dehne Ui = ai + and bi = cosOiai + uj^n. 

Then Eq. ([I4|) simplihes to 


Pr I < ^(cos0jai + 

\ i i 

= Pr + hi)(a^ - bi) < 0^ . 


(15) 


Lemma 2. Let ai, bi as defined as above. For any pair of i,j 
where i f j: 

1 ) Qi is independent from Oj 

2 ) bi is independent from bj 

3) Qi is independent from bj 

4) Oi + bi is independent from ai — bi 


Proof. The proof is given in appendix □ 

It follows from Lemmathat '^fiai + bi)(ai — bi) is the 
sum of products of independently distributed normal random 
variables. However the product of independently distributed 
normal random variables need not be normal, and so we need 
to show that (a^ + bi){ai — bi) is normally distributed. 


Lemma 3 (product of normal random variable p5|). Let x ^ 
N{pxTCfx) y ~ M(pLy,<Jy) be two independent normal 
variables. If p-xlc^x 00 and PyjfJy — ^ 00 in any manner, 
then the distribution of xy approaches normality with mean 
Pxdy and variance p^o-y + PyCr'^ + crlcTy. 

Applying Lemma and combining the independence stated 
in Lemma 1^ we have 


Lemma 4 . As a —>■ 0, '^fai + bi){ai — bi) ~ 

{J2i sin^ > 4cr^ J2i sin^ 

Proof. The proof is given in appendix □ 

It follows that Pr {J2i{o,i + bi){ai — bi) < 0) is the tail 
probability of a normal distribution. Applying the standard 
tail bound, we arrive at the following theorem. 


Theorem 4. As 0, the classification error is upper 

bounded as 


Pe < / £( 0 ,a,cr ) 


2^P{0i) 


q{a) 


da 


where S{0, ol, a^) = 4 exp 


ijlt 


8(7^ ^i(a?+cr 2 ) 

Proof. The proof is given in appendix 


□ 




(a) case 1 (b) case 2 

Fig. 2. Lines on which S is constant for the two case studies introduced in 
section liiTpl 


We return to the two case studies introduced in Section lTlI-DI 
to provide some intuition about the kernel £. The principal 
angles are [0,7r/2] in Case 1, and [7r/4,7r/4] in Case 2. In 
Case 1, the kernel is constant on horizontal lines, and in Case 
2, it is constant on lines of slope -1. These two cases are shown 
in Lig.|^ and we now make a number of general observations. 

Remark 4. 1. £(0,cy.,a^) is monotonically decreasing w.r.t. 
^^sin^^iCtf, and monotonically increasing w.r.t. <7^. There¬ 
fore, bigger principal angles or signal energy results in smaller 
classification error. Bigger noise results in bigger classification 
error. 2. Ignoring the higher order term of in the denomi¬ 
nator inside the exp(-), we have 


£{0, OL, cr^) 




which clearly indicates that classification performance is a 
function of discernibility (the sine principal angles) weighted 
by signal energy (the a^’s). 3. For fixed energy, classification 
error is decreased by allocating larger of to larger 9^. 


B. Numerical Analysis of Synthetic Data 

We now examine the agreement between empirical error 
and the upper bound given in Theorem 3. Set n = 6, d = 2, 


Ui 


[ 12 , 04 ]^ 


U 2 = 


cos 9 

0 


0 

cos 9 


0 0 sin 9 0 

0 0 0 sin6» ’ 


so that the two principal angles between Ui and U 2 are 9i = 
02 = 9. Set p{ct.) = q{oL) = A/'(a; 0, 12 ), and vary cr^ in 
[0.01,0.5]. Lig. 3a considers three values of 9 (tt/O, 7 r/4, and 
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Fig. 3. Comparison of empirical NSC classification error with the upper 
bound obtained by numerical integration, (a) Larger principal angles reduce 
classification error; (b) Disproportionate assignment of signal energy to larger 
principal angles reduces classification error. 


7r/3), and shows that empirical NSC classihcation error tracks 
the upper bound obtained by numerical integration. 

Next we examine the dependence of classification error on 
distribution of signal energy across the two modes. Set n = 6, 
d = 2 , Ui = [ 12 , 04 ]^ and 


U2 


cos(7r/6) 0 0 0 sin(7r/6) 0 

0 sin(7r/6) 0 0 0 cos(7r/6) ’ 


so that the two principal angles are 9i = 7r/6 and 62 = tt/S. 
Fix ||q:|P = 1, and compare the case when a is distributed 
such that |ai| < \a 2 \ (Case 3 in Fig. 3b 1 , with the case when 
a is distributed such that |ai| > \a 2 \ (Case 4 in Fig. 3bi. 
Empirical error is calculated for a range of noise variances, 
by randomly drawing 10,000 sample per class. Empirical NSC 
classification error tracks the upper bound given by numerical 
integration, with performance of Case 3 superior to that of 
Case 4. 


V. TRAIT: TUNABLE RECOGNITION ADAPTED TO 
INTRA-CLASS TARGET 

In the previous theorems, it is the principal angles that 
determine the performance of the classihers in different SNR 
regimes. This suggests that we might improve classihcation 
by applying a linear transformation that optimizes principal 
angles, even at the cost of reducing dimensionality. 

We denote the collection of all labeled training samples as 
X = [Xi,..., Xif] G where columns in the submatrix 

Xfe G j^nxNk samples from the fc-th class. The signal sub¬ 
space of Xfc is spanned by the orthonormal basis Ufc dehned 
above. The linear transform A G (jn < n) is designed 

to maximize separation of the subspaces AUi,..., AUif. The 
maximal separation is achieved when (AU^)^ (AU^) = 0 
for all j 7 ^ k. In this case, all the principal angles are 
tt/2. One approach is to use the SVD to compute the Ufc 
and then to learn the linear transformation A. However we 
may avoid pre-computing the Ufc by simply encouraging 
(AXj)T(AXfc) = 0 for all j ^ k. 

We shall require that the transform A preserve some specihc 
characteristic or trait of each individual class. Eor example, 
we may target (AXfc)^(AXfc) = X;rXfc for all k, so that 
the original intra-class data structure (with noise) is preserved. 
Given access to a denoised signal, Xfc, we might instead target 
(AXfc)^(AXfc) = X^Xfc again for all k. In this case, the 


intra-class dispersion due to noise is suppressed. Thus, the 
Gram matrix T of the transformed signal can be designed 
to target preservation of particular intra-class structure. We 
formulate the optimization problem as 

min ^||(AX)T(AX)-T| 1 |. (16) 

The block diagonal structure of the target Gram matrix T 
promotes larger principal angles between subspaces. At the 
same time the diagonal blocks can be tuned to different 
characteristics of individual classes. Eor example, when side 
information is available, we may consider incorporating it in 
diagonal blocks. Here we only consider 

T = diag{XTXi,...,XTXA}, (17) 

as a proof-of-concept. We refer to this approach as the TRAIT 
algorithm, where the acronym denotes Tunable Recognition 
Adapted to Intra-class Targets. 

It is possible to minimize the objective in E.q. ( [T 6 ] l ^ hrst 
minimizing |jX^PX-T||^ for P ^ 0 (as Proposition [H, and 
then factoring P as P = A^A where A G 

Proposition 1. The minimizer of ||X^PX — T|||. where P ^ 
0, is P* = (XXT)-ixTX^(XX^)-i. 

Proof. Proof is detailed in appendix □ 

However when m < n, such a rank-m decomposition may 
not exist since this P is not guaranteed to be rank dehcient. 
An alternative is to learn a rank dehcient P by solving 

min||X^PX-T|l^ + A||P|U, 

where the nuclear norm |jP||* regularizes the rank of P. 
However this approach requires careful tuning of A, and it is 
computationally more complex since we work with a matrix 
P larger than A. Given these considerations, we choose to 
solve using gradient descent as described in Algorithm 

Algorithm 1 TRAIT for feature extraction 
Input: labeled training samples X = [Xi,..., Xp-], target 
dimension m, (m < n), target Gram matrix T. 

Output: feature extraction matrix (transform) A G 
1 : Initialize A = [ei,..., e™]^, where is the z-th standard 
basis. 

2 : while stopping criteria not met do 
3: Compute gradient 

G = A(XX^A^AXX^ - XTX^). 

4: Choose a positive step-size 77 and take a gradient step 

A^ A-ryG. 

5: end while 


A. Related Methods 

Linear Discriminant Analysis (EDA) is a classical feature 
extraction method which assumes each class to be Gaussian 
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distributed. It achieves better performance on face recognition 
tasks than does PC A LDA does not assume near low- 

rank structure of the covariances, and therefore considers a 
different data geometry than the one here studied. 

Methods of feature extraction based on random projection 
have recently been developed and successfully applied to face 
recognition m- Random projection is designed to preserve 
pairwise distances between all data points uniformly across 
class labels 1 ^ . 

More recently, the Low-Rank Transform (LRT) has been 
proposed as a method of extracting features |j^. It enlarges 
inter-class distance while suppressing intra-class dispersion. 
LRT uses the nuclear norm, ||AXi||*, to measure the disper¬ 
sion of the (transformed) data. The transform A is 

K 

argmin ^ |jAXj||* - ||AX||*. 

AgB™x":||A||2<c 

What motivates the choice of the nuclear norm is that it is 
the convex relaxation of rank |[8). In the high SNR regime. 
Theorem suggests that classihcation error decreases when 
the union of subspaces has large rank. LRT encourages the 
rank of the union to be large, and it works well in a regime 
where model mismatch is small. Experiments presented in 
Section [V-C| suggest that TRAIT may be more robust to model 
mismatch (Fig. [^. 


B. Two Properties of the TRAIT Transform 

On synthetic and measured data, we show that TRAIT 
effectively enlarges the angles between different subspaces 
and preserves intra-class structure. We also compare the 
classification accuracy of features extracted by TRAIT and 
the methods in Section V-A For synthetic data, the class 


distribution is known exactly, and the MAP classiher is used to 
measure classification accuracy. For measured data, the class 
distribution is unknown a priori, and the NSC classifier is 
employed instead. 

1) Enlargement of the Principal angles: The synthetic 
dataset has parameters n = 10, d = 1 and K = 3. 


Sfc = UfcU;I +10-^I(A: = 1,2,3) 


where Ufc is a normalized n-vector with i.i.d. Gaussian 
random entries. Samples of the fc-th class are i.i.d drawn from 
A/^(0,Sfe). For each class, 100 samples are used for learning 
the transform and 10000 are used for testing. On the training 
data, we learn the transform respectively via LDA, LRT, and 
TRAIT with target dimension m = 3,..., 10. Then on each 
test datum, we apply the learned transforms as well as random 
projection (each entry drawn from Af{Q, 1)) and classify using 
a MAP classiher. 

We visualize original and transformed data via projection 
(PCA basis) into 3-dimensional Euclidean space. When the 
target feature dimension m = 3, the results are shown in 
Fig. 1^ Each class is represented by a different color. After 
transforming the data, we use the SVD to calculate the basis 
vector (d = 1) that best describes each class, and we calculate 
the pairwise angles between basis vectors. The pairwise angles 


are signihcantly increased by both LRT and TRAIT. By con¬ 
trast, neither LDA nor random projection increase separation 
between one-dimensional subspaces. 



-1 -1 


(a) Original, angles: 
66.1°, 21.8°, 70.0° 



-1 -1 


(d) LDA, angles: 
47.0°, 86.4°, 87.8° 



(b) TRAIT, angles: 
87.8°, 72.9°, 88.7° 



-1 -1 


(e) Random, angles: 
72.4°, 2.5°, 73.0° 



(c) LRT, angles: 
86.0°, 77.1°, 83.0° 


Fig. 4. Embeddings of original and transformed data. 


We now vary the feature dimension m, and compare the er¬ 
ror probability of the MAP classiher across the different meth¬ 
ods of extracting features. Fig.j^shows that the performance of 
TRAIT and LRT are similar, and that both are superior to LDA 
and random projection. Note that after dimension reduction. 
TRAIT is still able to match error probabilities achieved with 
the original data. 

2 ) Preservation of Intra-class Structure: When a convex 
body, e.g., human face, is illuminated, the resulting image is 
represented by spherical harmonics. It has been shown that a 
9-dimensional subspace is sufficient to capture the geometry of 
an individual subject |jT]. The extended Yale B face database 
includes 38 subjects, each with 64 images taken under different 
illumination conditions. We use a cropped version of this data 
sef] where each image is of size 32 x 32 = 1024. 

For each subject, we randomly select half of the 64 images 
for training, and retain the other half for testing. For all feature 
extraction methods, we vary the target dimension m, and apply 
the NSC to the transformed data. The NSC achieves much 
higher accuracy on features extracted by TRAIT and LRT 
(Fig. 1^. 



Fig. 5. MAP classifier’s Pe on 
transformed data. Note that TRAIT 
(blue) and LRT (red) almost over¬ 
lap. 



m 


Fig. 6. NSC’s Pe on origi¬ 
nal/transformed face images. Con¬ 
catenation of TRAIT and LRT fea¬ 
tures (TRAIT-hLRT) provides su¬ 
perior results 


'http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html 


































We also observe in Fig. |7] that the features extracted by 
TRAIT and LRT are quite different, suggesting that infor¬ 
mation present in one view is somewhat independent of 
information present in the other. This is confirmed by applying 
NSC to the concatenation of the two views (TRAITh-LRT in 
Fig.|6|l, and observing that classification accuracy is increased. 

The intra-class structure preserving property of TRAIT is 
evident in Fig. |7] where we view transformed classes as 
faces in the original image domain. The original images 
of subject 10 are displayed together with their TRAIT and 
LRT transforms. TRAIT preserves a diversity of illumination 
conditions, whereas LRT blurs the differences between images. 
Classification performance is improved by using LRT and 
TRAIT features in combination. 



Fig. 7. Comparison of original images (top) with TRAIT transformed images 
(middle) and LRT transformed images (bottom). Red circles indicate structure 
that is present in both the original and the TRAIT transformed image. 


C. Robustness to Model Mismatch 

In the previous sections, we have demonstrated the ef¬ 
fectiveness of TRAIT and LRT on both synthetic and real 
data. In this section, we present experiments showing that 
TRAIT is more robust with respect to model mismatch than 
is LRT. In many real world problems, data may not be exactly 
GMM distributed. Even if they are, there may not be sufficient 
training data to learn the covariances. Therefore, we use NSC 
throughout this section to assess the discriminability of the 
extracted features. Moreover, having seen the effectiveness of 
dimension reduction in previous sections, we turn to learning 
dimension reduced features, thereby saving computational cost 
on measured datasets. 

1) Synthetic Data: The synthetic data is a three-class 
dataset, where datum x € in the fc-th (k = 1, 2, 3) class 
is generated as 

X = VkOt + n, 

with Ufc e M100X5 uJUfc = I. a - Uniform[-2,2] 
and n ~ A/'(0, u^Iioo)- Note the data is not GMM distributed. 
Each class has 100 training samples and 10000 testing sam¬ 
ples. We vary cr^ and use NSC to classify TRAIT and LRT 
extracted features. Here we fix the extracted feature dimension 
to be 30. 

Fig- in shows the NSC classification accuracy as a function 
of a. Both TRAIT and LRT significantly improves classifica¬ 
tion performance compared with no transform. However, with 
increasing noise, TRAIT features outperform LRT features, 
showing greater robustness to model mismatch. 



Fig. 8. NSC performance on 
TRAIT and LRT features under 
different SNR 


i i 

0 @ 

I -y i-i i'-A 

Fig. 9. From top to bottom 
row: subjects in PIE, UMIST and 
ORL database, taken under differ¬ 
ent poses 



2) Face Images with non-frontal Poses: It is known that 
human frontal face images are well modeled by subspaces. 
Eor example, the Yale-B face in section |V-B2[ where LRT 
slightly outperforms TRAIT. Now we further compare the 
performance of TRAIT and LRT in more mismatched cases by 
introducing non-frontal face images. We validate performance 
on three publicly available datasets, PIE |^ , UMIST[^ and 
ORL0 All of them have a considerable number of non-frontal 
face images. Fig. [^shows one subject from each database with 
different poses. 

The PIE dataset includes 18562 64 x 48 images of 68 
subjects. Each image is labeled with one of 13 different pose 
tags. We randomly select 7 pose tags and the images of these 
tags are used as training samples. The rest are used in testing. 
UMIST comprises 575 112 x 92 images of 20 subjects, and 
ORL comprises 400 112 x 92 images of 40 subjects. These 
two datasets have no pose tags. We split the UMIST and ORL 
datasets using the strategy followed for the Yale-B dataset in 
Section rV-B2l We derive 1000-dimensional features for each 
of random projection, LDA, LRT and TRAIT. Table lists 
accuracies of NSC classification for the different algorithms. 

TABLE I 

NSC ACCURACY ON ORIGINAL AND 1000 DIMENSIONAL (COMPRESSED) 
EXTRACTED EEATURES 



PIE 

UMIST 

ORL 

Original 

74.57% 

96.14% 

95.50% 

random 

72.14% 

95.44% 

94.50% 

LDA 

40.10% 

84.91% 

92.00% 

LRT 

70.80% 

96.84% 

95.00% 

TRAIT 

76.11% 

97.90% 

97.00% 


In all cases, TRAIT has the highest classification accuracy 
and outperforms LRT. LRT optimizes the rank (its convex 
relaxation), which is critical for reducing classification error 
in the high SNR regime. However, in this low SNR regime, 
TRAIT gains more discrimination via explicitly “orthogonal- 
izing” between the classes. The criteria employed by TRAIT 
do not depend on the specific SNR regime and therefore are 
more robust. 


VI. Conclusion 

In a low-rank Gaussian Mixture Model, we have explored 
how the probability of misclassification is governed by prin- 

^http://www.sheffield.ac.uk/eee/research/iel/research/face 

^http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html 
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cipal angles between subspaces. In the low-noise regime, the 
Bhattacharyya upper bound on misclassihcation is determined 
by the product of the sines of the principal angles. In the 
high/moderate-noise regime it is determined by the sum of 
the squares of the sines of the principal angles. Analysis 
of the Nearest Subspace Classiher connected reliability of 
classihcation to the distribution of signal energy across prin¬ 
cipal vectors. Classihcation was shown to be more reliable 
when more signal energy is associated with principal vectors 
corresponding to large principal angles. This observation moti¬ 
vated the design of a transform, TRAIT, that achieves superior 
classihcation performance by enlarging principal angles and 
preserving intra-class structure. Finally we showed that TRAIT 
complements a prior approach that enlarge inter-class distance 
while suppressing intra-class dispersion, and that it is more 
robust to model mismatch. 


Appendix A 

Proof of high SNR case 

Proof of Theorem [T] We have 

detSi= (^i.i+ 0 -^) > 

detS 2 = + 

Let the SVD of Ui.nAi^nUj^n + -f 

U2,nA2,nUj(-, -t- U2^\A2^\Uj^ be ZAZ^, where 

A = diagjAi,..., X2d-r}- Then, 

det n' (^ + . 


Substituting the above into the Bhattacharyya bound, we have 




ynti(AM + a^)nti (A2, + a^) 

UZ7{^ + ^^) 


d-r 2d-T 


={a^) 2.2 2 


-1 


ni=iAi,i ni=i a2,z 


n-=7^A, 


(18) 

Our objective is to expand ^ A^ in terms of principal 

angles. Since the image of Ui n (or U 2 ^n) is orthogonal to 

Ui_\ and U 2 ^\, 



2d-r 

Ai = pdet(Ui^nAi^nUj^j-, -I-U 2 ,nA 2 ,nUjQ) 

• pdet([Ui,\Aj^ U2^\A|g[Ui.\Aj^ U2,\AI\V) 

= pdet (Ui nAi + U2^nA2,nUjp) 

• det([Ui,\Af^^ U 2 AA|/[Ui^\Af_^ U 2 AA|g), 


where we assume n > 2{d — r) in order to derive the second 
equality, which simplihes as follows: 


det([Ui,\Af^^ 


U 




U 2 aA|,\]) 


= det 


A^u7^ u,\ A 


N.\ 


1 A'^ 2 A^ 2 A 

A 2 A 


Ai.\ 

= det(Ai_\) det (A2A — 

= det(Ai,\) det (a|^(I - UT\Ui,\UT\U2a)a|^) 


d—r d—r d 

— n AiA.* ■ n A2A,j ■ n 

i—l i—1 

(19) 

The last equality follows from the observation that the eigen¬ 
values of UJ^Ui_\Uj'^^U 2 a are cos^ ^r+i, • • ■, cos^ Od- The 
theorem now follows by substituting Eq. ( [T^ into Eq. ( [T 8 ] l. 

□ 


Appendix B 

Proof of Low SNR case 

We hrst state and prove (for completeness) two preliminary 
lemmas that are needed to characterize classihcation error. 

Lemma 5. Let D € be any positive semi-definite matrix 
with all eigenvalues smaller than 1, then 

tr(D) — - tr(D^) < lndet(Iri -t D) < tr(D) — - tr(D^). 

Proof. Denote the nonnegative eigenvalues of D ^ 0 as 
di,..., d„, where di,... ,dn < 1. Then 

n n 

lndet(I„ + D) = In J([(l + di) = ^ln(l -f di). 

i=l z=l 

Since x — ^ < ln(l -f a;) < a: — ^ for all a; S [0,1], we 
obtain 

^ dz - y < lndet(I„ + D) < d, - 

i i 

which reduces to 

tr(D) — - tr(D^) < lndet(I„ -f D) < tr(D) — - tr(D^). 
This bound is very tight when all the dfs approach 0. □ 

Lemma 6 . Suppose U G V G IJfixd Qfg fvvo orthonor¬ 
mal bases and that $ G ’S' G are diagonal 

with nonnegative decreasing diagonal elements ..., fd 
and ifi,... ^ipd respectively. Denote the i-th principal angle 
between U and V as 9i where i = 1,..., d. Then 

Md COS^ Oi < tr(U$U^V’®'L^) < fitfi Y 

i i 

Proof. Let the Singular Value Decomposition of U^V be 
JCH^, then tr(UTVVTU) = tr(C2) = X^zCOS^d,. We 
have 

tr(U$U^V’®'V^) = tr($U^VS'V^U) 

= tr($JCH^S'HCJ^) = tr(J^$JCH^’®'HC). 













10 


For any two positive semidefinite matrices A,B G 

let the maximum and minimum eigenvalues of A be 

Ai(A), Am(A) respectively, then by 


A^(A)tr(B) < tr(AB) < Ai(A)tr(B). 

Hence, 

tr(U$UTV’®'VT) < (j)i tr(CHT’®'HC) = (j)i tr(HT^HC^) 
< tr(C^) = ^ cos^ 6 i. 

i 

The lower bound can be proved in the same way. This bound 
becomes tight when the diagonal elements of $ and ’J' are 
uniform. □ 

Proof of Theorem We are now ready to prove theorem 
We expand K in Eq. 0 as 

1 /Si + SoA 1 

K = - Indet ( - - - j — -(Indet Si+lndet S 2 ). (20) 


Combining Eq. to ( |26l l, we obtain upper and lower 

bounds on K, 

1) Upper bound: 


K <- 


.1 


The second term becomes: 

d / \ \ d 


1 




( 21 ) 


and we use Lemma |3 to bound the first term. Note that 


1 Indet 

2 V 2 


= - In det 

2 


I 


U 1 A 1 U 7 +U 2 A 2 UTM 

2 a 2 


/ 21 1, 1 UiAiUT+U 2 A 2 UT 

- 2 ) + 2 V + 2 a 2 


( 22 ) 

Let D = ^ apply Lemma [sj to bound 


1 Indet (^ 1 ^): 


^ 1 / 2 i 1 


tr(D)--tr(D“) 


< - In det 


I 

Si + S 2 


n , , r,, 1 

< 2 ln(-^) + 5 


tr(D) - - tr(D2) 


Expanding tr(D) gives 


(23) 


r ■ 2 \ 1 

2‘"<‘’) + 4 

< - In det 
- 2 


d 


d 


(j 2 ^2 

Si + S 2 




-4tr(D2) 


(24) 


< 2+ i 

Note that 


a ^ ® \ 

E^+E^ 

(j2 (j2 

. 2=1 2=1 


-Gtr(D 2 ). 


A 


E /' 1,2 ^ 2,2 

, 2=1 2 = 1 


32cr4 


E! •^l.i ^ E! ^2,i + 2 Ai,dA 2 ,d E! 




Al,i 1 f 


. 1=1 i=l ^ \ 


i=l 

2 d 


1 


A 2 .,, Iv^/A 




tr^ 2 V 2 (t 2 

.i=l i=l 


Al.dA 2 .dE‘^ 0 ®^ 


2 d 


A 




2=1 


2) Lower bound: 


C2 - Y^Ai,dA2,d E *^0®^ 


2=1 


(27) 


^^4 


E^+E^ 

fj2 fj2 

.2 = 1 2=1 


d 


16cr4 E + E ^2.i + 2 Ai,iA2,i E 


2 = 1 
A: 

a 


^ 2,2 


(7^ ' \ 2(7^ 

,2 = 1 2 = 1 


- ^Ai,iA2,i E 


2=1 

2 d 


-E>- 1 + 5 ^ 


C 3 


1 ^, A 

- ^Ai,dA 2 ,dE‘^ 0 ®^^i ■ 


2 = 1 


(28) 

□ 


tr(D2) = ^( EA?,.+EE*+2tr(UiAiU7U2A2Uj) 


2 = 1 2=1 


Envoking Lemma to bound the last term of the above: 

tr(UiA iU 7 U2A2UJ) > Ai,dA2.d E cos" 8 , 

i 

tr(UiAiU7U 2 A 2 UJ) < Ai,iA2,i E cos" 0 , 


(25) 


(26) 


Negating K and exponentiating gives theorem]^ 

Appendix C 

Proof of Moderate SNR Case 
Proof of Lemma [I] consider the function 

/(A,) = ln(l+Ai)-ln(l+p)--E(Ai-p) + 

1 + p 

defined in [0,p]. Its derivative is 

1 1 2{Xi-p) (p-Ai)(p-1 - 2Ai) 


1 


(1 +p)' 


r(Ai-p)", 


/'(AO = 


1 + Ai 1+p (l+p)2 (l + Ai)(l+p)2 
















































II 


which is positive in [O, and negative in 

Therefore, f{\i) is monotonically increasing in [O, and 
decreasing in Further, f{p) = 0 and /(O) = 

— ln(l +p) + whose sign depends on the value 

of p. The shape of f{Xi) is now characterized. There exists 
L < such that f{Xi) > 0 when Xi G [L,p]. 

□ 

Before proving theorem we need to bound Xi using 
Weyl’s inequality pT| . 

Lemma 7 (Weyl’s inequality (D). Let M and P be two 
n X n Hennitian matrices, with eigenvalues fJ-i > ■ • • > Pn 
and oi > • • ■ > Vn respectively. Denote the eigenvalues of 
M + P hy 7 i > • • • > 7 „. Then 

max(^i + i/n, Vi + pn) < 7 i < min(^i + vi,Vi + pi). 

Proof of Theorem Since < p, by the 

Weyl’s inequality, < Xi < = p. Further, 

since 1 < c(p) < 3 ^, we have Ai,...,A 2 d_r G [L{p),p]- 
By dehnition of L{p), we can invoke Eq. 0 in Lemma to 
obtain 

/ y 4- y \ 

Indet I ——- I = ^ ln(l + Ai) + nln(cr^) 

i=l 


> { 2 d — r) ln(l + p) + 


tr D — p{2d — r) 


1 +p 


(29) 


tr — 2p tr D + p^{2d— r) 

(1 +p)2 


+ nln((T^). 


Notice tr D = 2 j, and by Eq. ( |25] l and 


,_. trD^ < ^ + ^2.i + 2 Ai4A2,i dj). 

Substituting these into Eq. (|29ll, we get 


In det 


Si + S 2 


>n In(cr^) + { 2 d — r) 
1 + 3p 


ln(l +p) - 


l+p (l+p )2 


2cr2(l + p)2 
1 

40-4(1 +pY 


A 17 + A 27 


-^17+^27+2A17A27 cos^ 9 i 


Substituting the above into the Bhattacharyya bound 0 yields 
an upper bound on P^, of the form given in Theorem In 
particular. 


C4 = 


ln(l+p)- ^ 

l+p ( 1 +p)^ 


and 


C 5 = 


l + 3p 
4o-2(l + pY 


(-^i^+ X2,i 




In ( 1 + ) + In ( 1 + 


Hi + ^27 
8o-4(l +pY 

A27 


□ 


Appendix D 
Analysis of NSC 

Proof of Lemma 1^ Since that the joint distribution of [oi 07 ]^, 
[bi 67 ]^, [oi 67 ] 4 and [oi + bi Oi — bi]^ are all Gaussian, it 
suffices to show that all covariance are diagonal. Eor any i Y j, 


'AA 


cos 9 jaj 


'N 


cos 9iai 


Eor any i. 


'N 


OLi 

cos 9iOLi 


1 

cos 9i 


cos 9i 
1 


O-i + bi 
CLi — bi 


'N 


(1 + cos 9i)ai 
(1 - cos 9i)ai\ ’ 

(1 + COS^i) 

0 


2a- 


0 

(1 — COS^i) 


(30) 

which concludes the proof. □ 

Proof of Lemma 1^ As ^ 0, the mean-covariance ratios of 

both tti + bi and — bi tend to inhnity. Therefore, applying 
Lemma to Eq. ( |30| ) (see proof of Lemma |^, we have (ai + 

bi) {ai - bi) ^ N (sin^ , 4a-^ sin^ Applying 

the independence between {ai + bi){ai — bi) and (a 7 + 67 )(a 7 — 

bj ) (z Y j), we obtain the desired result by summing the mean 

and variance over all i. □ 

Proof of Theorem]^ We prove the theorem by deriving upper 
bounds on Pr(C 2 |Ci,Q:) and Pr(CiIC 2 ,a). 

Pr(C 2 |Ci,Q;) = Pr ( y^(a, + bi){ai - 6 i) < 0 ) 


^ I - h) - ^ 


2o-\/E*sin2 + 

E* sin^ 9,a'i 


- ( 31 ) 


As o- —0, the term to the left of “<” in the last line of Eq. pT] ) 
is standard normal distributed. Therefore we can invoke the 
Gaussian tail bound to obtain 

Pr(C2|Ci,Q;) 

Ei(ai + bf){a^ - bi) - sin^ 


= Pr 


> 


<2 exp 


2o-VE*sin2 Hai + (T'^) 
2'^\/E* sin^ 

(EiSin^ 9,aif 


(32) 


80-2 Yi sin^ + 0 - 2 ) 

Pr(Ci|C 2 , a) can be upper bounded in the same manner; 


Pr(Ci|C 2 ,a) < - exp 


(E,sm2 


8c^^EiSin 9Yal + 0 - 2 ) 


(33) 
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Therefore, 


Pe J Vv{C2\Ci,a)p{a)doL + ]^ J FT{Ci\C2,a)q{a)da 


</ 2exp 


(E. 


sin^ Oiaf) 


= j £{9, a, a) 


8a^ Ei sin^ 9,(af + a^) 

p(a) + g(a) 


p(cx) + q(cx] 


-da, 


which concludes the proof. 


da 


(34) 

□ 


li 


Appendix E 

Proof of Proposition 1 Observe that 

IIX^PX - T|||, = II(X^ (g) X^) vec(P) - vec(T) 
is a least squares problem with minimizer 

vec(P*) = (X^ (g) xT)t vec(T) = X^, 
which can be rearranged to give 

P* = (XT)^T[(X^)1']T = (XX^)-iXTX^(XX^)-i ^ 0. 

□ 
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